Note
In this class, we will use R Notebooks to intermix code with our documents. The document part of an R Notebook uses the R Markdown formatting convention. You can find cheat-sheets here.
Generally, your workflow will look like:
Create an RStudio project. It will ask you for a directory where you want to keep your work. I suggest you put this on a cloud-synced folder using a service like Dropbox or make regular backups. If you do use cloud storage, rendering notebooks will generate a lot of synchronization events when doing analysis. I find this not too cumbersome with Dropbox, but services like Box perform very badly in this case. You might need to pause synchronization or attempt to lower the run priority of the synchronization daemon. You can also enable version control on your project, which will allow you to do point-in-time recovery of earlier versions of your code and synchronize your repository with a remote service like GitHub. RStudio has a nice wrapper for using git. You might have a look here.
Create a new workbook by going to File > New File > R Notebook. It will be created with some example markup for you to study.
Edit the bits at the top between the --- statements to reflect the title, etc. This header section is formatted in YAML (some more information is here.)
You may want to enable chunk caching as described here. This will avoid re-running code every time, but can lead to code that does not update. You can clear the cache manually in that case.
Write a bit of markdown text to describe what you are doing.
```{r}Your code```
(those are back-ticks). The r designates R as the language. You can use other languages like python as well!
When your notebook is finished, optionally commit your changes and push to the cloud.
R can make lovely maps with a bit of patience and tweaking. There are huge range of options for reading, writing and plotting geospatial data in R. You can find a summary of many packages in the spatial task-view on CRAN.
Note
First we need to load some extensions or “packages” into R. I have started using the p_load function from the pacman package because it will automatically install the package if it is missing. But first you need to install the packman package. You can do this by going to Tools > Install Packages... and then entering pacman in the RStudio menus. If you have a better/newer way to do this, let me know! Note than when I call the p_load function, I preface it with the pacman:: namespace designation. This calls the function without loading the package.
tryCatch(library(pacman), error = function(...)
{
install.packages("pacman")
})
pacman::p_load(tidyverse) # An increadibly useful array of packages that you will probably always want to use
pacman::p_load(rgeos) # GIS library
pacman::p_load(maps) # An oldie-but-goodie, the maps package contains some mapping data we can use
pacman::p_load(mapproj) # Allows map reprojections
pacman::p_load(maptools) # Another package that contains map data and other utilities
pacman::p_load(broom) # Utilities for tidying data
pacman::p_load(sf) # The simple features library
pacman::p_load(ggmap) # Really nice package for mapping
gpclibPermit() # Very strange handling of licensing issue
## [1] FALSE
Note
R has extensive, built-in documentation. You can go to the Help tab in the lower right panel of RStudio to search and read help. In the console, ??<topic> will search for topics, include similar sounding topics. ?<topic> will search for a specific help page. I often use <topic> site:r-project.org in Google as well.
The tidyverse collection include the plotting package ggplot2, which has some basic facilities for mapping data. First we will get some outline data of world continents and look at the structure of the object. To learn more about the tidyverse, visit the R for Data Science book.
world = map_data('world') # this ggplot2 function gets data from the maps package
str(world) # the str function dumps the internals of an R object
## 'data.frame': 99338 obs. of 6 variables:
## $ long : num -69.9 -69.9 -69.9 -70 -70.1 ...
## $ lat : num 12.5 12.4 12.4 12.5 12.5 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ subregion: chr NA NA NA NA ...
The data are stored in a table format called a data.frame. Data frames are the fundamental tabular type on which most of R analytics rests. This one has six columns and nearly 100,000 rows. Each row in this case is a latitude and longitude with some ancillary information indicating which polygon each point belongs to. Ggplot2 uses this information to render the polygons.
Note
Ggplot2 works by constructing a pipeline. Each subsequent command either adds a layer of output to the plot or modifies existing layers. Every pipeline starts with the constructor ggplot and then adds items using +. Somewhere, the aesthetics of the plot must be specified. You use the aes command for that. Generally, you will need at least aes(x = <x-var>, y = <y-var>) to make a plot.
ggplot(world) + # create a ggplot2 object; the + operator appends following objects to make the full plot
geom_polygon(aes(x = long, y = lat, group = group)) + # this tells ggplot to make polygons
coord_map() # coord_map sets up the axes to be a map plot
There are a couple of problems with this map. First, it is clear that there are polygons that cross \(\pm\) 180 degrees longitude and that is causing the drawing routine to make a streak across the top. This is a common problem when working with global polygons using a Cartesian drawing system. If you see this, you will need to change the data set or the rendering software. The second problem is that the default projection is Mercator, which distorts high- and low- latitude regions. Lets redraw using a good global projection, the moleweide project. Ggplot2 knows how to call the mapproj package to reproject the coordinates.
ggplot(world, aes(x = long, y = lat, )) +
geom_polygon(aes(group = group, fill = group)) +
coord_map("mollweide") + # reproject
theme_minimal() + # ggplot uses themes to control output
guides(fill = FALSE) # This is actually quite important
Here we added some color by specifying the fill attribute to geom_polygon and used the group factor to choose the colors. The default color map is not too keen, but we can fix that. The best colors come from the RColorBrewer package and ggplot2 knows how to use it.
data(wrld_simpl) # from the maptools package
w = tidy(wrld_simpl) # make this something that ggplot2 can work with
## Regions defined for each Polygons
str(w)
## 'data.frame': 26264 obs. of 7 variables:
## $ long : num -61.7 -61.9 -61.8 -61.7 -61.7 ...
## $ lat : num 17 17.1 17.2 17 17.6 ...
## $ order: int 1 2 3 4 5 6 7 8 1 2 ...
## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ piece: Factor w/ 475 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 1 1 ...
## $ group: Factor w/ 3768 levels "ATG.1","ATG.2",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ id : chr "ATG" "ATG" "ATG" "ATG" ...
The tidy command converted the map data to a data frame and added an indicator group to designate the polygons. There is also information about whether a given polygon is an outline or a hole. Holes are usually drawn empty with the background color showing through. Of course, one can have holes within holes, as would happen with an island in a lake in a continent. Most GIS will try to account for this hierarchy, but not all system do it or do it correctly.
ggplot(w, aes(x = long, y = lat)) +
geom_polygon(aes(fill = group)) +
coord_map("vandergrinten") +
guides(fill = FALSE) +
theme_minimal() +
ylim(-55, 85) # limit latitude a bit
The default colors are pretty ugly. For better colors, we can convert the fill to a number and use the scale_fill_distiller command. We will also use the sample command to randomize the colors each time. The sample function will randomly permute the input vector, e.g., sample(1:5) == c(5, 1, 2, 4, 3). The rgb command builds a color with different red, green and blue intensities. The fourth argument to rgb sets transparency.
ggplot(w, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = sample(as.integer(group))),
color = rgb(1, 1, 1, 0.1)) + # color sets outlines
scale_fill_distiller(palette = "Set1") + # RColorBrewer palette
coord_map("sinusoidal") +
guides(fill = FALSE) +
theme_minimal() +
ylim(-55, 85)
Projections are fun. You might want to write a function that takes the projection string as an argument and make some more plots. Here we use the ... arguments to pass optional arguments to the coord_map function. This allows you to use projections that require extra parameters. This function returns a ggplot2 object, which then gets printed to the console by default. The print command for ggplot2 objects is overridden to draw a graphic. You can learn more about this by studying S3 object methods. Notice here that we set the group aesthetic to designate the polygon membership, and then use the permuted integer values of group for colors.
world_proj = function(projection, ...) # The result of the final expression in the function is returned
{
ggplot(w, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = sample(as.integer(group))),
color = rgb(1, 1, 1, 0.1)) + # color is NOT an aesthetic
scale_fill_distiller(palette = "Set1") +
coord_map(projection, ...) + # ... is replaced dynamically
guides(fill = FALSE) +
theme_minimal() +
ylim(-55, 85)
}
world_proj("gall", 0) # zero passed to the coord_map function
The sf library stores and manipulates OGC “Simple Features”. This is a widely established standard for working with GIS data. You can read more about it here or search the web. An introduction to sf is here. We will be using the sf library for working with vector data: points, lines, polygons, etc.
nc <- st_read(system.file("shape/nc.shp", package="sf")) # Read a shapefile included with the sf package
## Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/3.4/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## epsg (SRID): 4267
## proj4string: +proj=longlat +datum=NAD27 +no_defs
The big change between the sf package and the old sp package is that simple features are stored independently in columns of a data frame. This make working with simple features much easier. You can see below that there is a column in the data frame named geometry. That holds the simple features. And in this case it is a set of MULTIPOLYGON simple features.
Note
The str function is really useful on the command line, but do not use it when writing software or coding analyses. There are other ways to get information about objects. str is just for taking a quick look. You might also have a look at the glimpse function in tidyverse.
str(nc) # internal structure
## Classes 'sf' and 'data.frame': 100 obs. of 15 variables:
## $ AREA : num 0.114 0.061 0.143 0.07 0.153 0.097 0.062 0.091 0.118 0.124 ...
## $ PERIMETER: num 1.44 1.23 1.63 2.97 2.21 ...
## $ CNTY_ : num 1825 1827 1828 1831 1832 ...
## $ CNTY_ID : num 1825 1827 1828 1831 1832 ...
## $ NAME : Factor w/ 100 levels "Alamance","Alexander",..: 5 3 86 27 66 46 15 37 93 85 ...
## $ FIPS : Factor w/ 100 levels "37001","37003",..: 5 3 86 27 66 46 15 37 93 85 ...
## $ FIPSNO : num 37009 37005 37171 37053 37131 ...
## $ CRESS_ID : int 5 3 86 27 66 46 15 37 93 85 ...
## $ BIR74 : num 1091 487 3188 508 1421 ...
## $ SID74 : num 1 0 5 1 9 7 0 0 4 1 ...
## $ NWBIR74 : num 10 10 208 123 1066 ...
## $ BIR79 : num 1364 542 3616 830 1606 ...
## $ SID79 : num 0 3 6 2 3 5 2 2 2 5 ...
## $ NWBIR79 : num 19 12 260 145 1197 ...
## $ geometry :sfc_MULTIPOLYGON of length 100; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:27, 1:2] -81.5 -81.5 -81.6 -81.6 -81.7 ...
## ..- attr(*, "class")= chr "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr "AREA" "PERIMETER" "CNTY_" "CNTY_ID" ...
Making a plot of a data frame with simple features is pretty easy.
plot(nc, max.plot = 14)
By default, every non-simple-feature column was used to color the polygons and a separate map generated for each. The sf package chooses a default palette for continuous and categorical maps. Notice that the factors (see the output of str above), R’s type for categories, plotted with a different color scheme.
The ggmap package can produce remarkably nice maps. It does this by pulling map data from service like Google Maps, which you can then render locally. What is really nice is that ggmap can exploit Google’s geocoding service to look up place-names.
utmap = get_map("University of Texas at Austin")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=University+of+Texas+at+Austin&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=University%20of%20Texas%20at%20Austin&sensor=false
tryCatch(ggmap(utmap), error = function(...)
{
devtools::install_github("dkahle/ggmap") # get dev version if needed
ggmap(utmap)
})
Now lets load some data to overlay. I will use the sf package to load zip-code polygons for the Austin area. (I have conveniently stashed them with this tutorial.) One rather non-intuitive and unfortunate oversight in the sf package is that it requires you to specify a layer name in addition ot the file name. This is because GIS formats usually can contain multiple layers. It probably should “just work”, but does not, so you have to examine the input file to find the layers. I happen to know the layer name in this data file.
Also, sf is pretty new, so it might throw some errors. In the following, I set the map projection manually. You can find out about map projections and EPSG codes at epsg.io.
zipcodes = read_sf("Zipcodes.geojson", "OGRGeoJSON")
st_crs(zipcodes) = 4326 # Set the map projection to WGS84
str(zipcodes)
## Classes 'sf', 'tbl_df', 'tbl' and 'data.frame': 80 obs. of 12 variables:
## $ created_by: chr "" "" "" "" ...
## $ modified_b: chr "" "" "" "" ...
## $ zipcode : chr "78754" "76530" "78739" "78645" ...
## $ name : chr "AUSTIN" "GRANGER" "AUSTIN" "LEANDER" ...
## $ shape_area: chr "369136949.49504" "2916191810.61907" "338340541.543711" "1043395959.57919" ...
## $ created_da: chr "" "" "" "" ...
## $ objectid : chr "1" "2" "3" "4" ...
## $ geodb_oid : chr "1" "2" "3" "4" ...
## $ modified_d: chr "" "" "" "" ...
## $ shape_len : chr "98511.3056366303" "300079.66411114" "97785.7933337692" "196654.442799695" ...
## $ zipcodes_i: chr "54" "55" "56" "57" ...
## $ geometry :sfc_MULTIPOLYGON of length 80; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:372, 1:2] -97.6 -97.6 -97.6 -97.6 -97.6 ...
## ..- attr(*, "class")= chr "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr "created_by" "modified_b" "zipcode" "name" ...
plot(zipcodes, max.plot = 11) # all attribute columns plotted
Now with ggmap. We have to use the tidy command from ggplot2 to convert from the sf format to a regular data frame with latitude and longitude coordinates. We do this by first converting to the old sp format and then evoking the tidy command. When tidying map data, you need to specify the grouping variable, which in this case is zipcode. In the plot below, I use alpha to overlay transparent fills.
zip = tidy(as(zipcodes, "Spatial"), region = "zipcode") # conver to sp and then tidy
ggmap(utmap) +
geom_polygon(data = zip,
aes(x = long, y = lat, group = group, fill = group),
color = rgb(1, 0, 0, 0.2), alpha = 0.1) +
guides(fill = FALSE)
The
get_map function takes a zoom argument where 3 is continent and 21 is building-scale. You can also specify different map types, like satellite or terrain.
utmap = get_map("University of Texas at Austin", zoom = 16, maptype = "satellite")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=University+of+Texas+at+Austin&zoom=16&size=640x640&scale=2&maptype=satellite&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=University%20of%20Texas%20at%20Austin&sensor=false
ggmap(utmap) +
geom_polygon(data = zip,
aes(x = long, y = lat, group = group, fill = sample(group)),
color = rgb(1, 0, 0, 0.2), alpha = 0.1) +
annotate("text", y = 30.285174, x = -97.735462, color = "white",
label = "We are here") +
guides(fill = FALSE)
maps and maptools packages and find data mapping other regions.sf library. [Hint: use ?plot.sf to get help with the sf version of plot.]ggmap to plot a map of texas with county boundaries.